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Abstract 

Ensemble learning of K nonlinear perceptrons, which determine their outputs by sign functions, 
is discussed within the framework of online learning and statistical mechanics. One purpose of 
statistical learning theory is to theoretically obtain the generalization error. This paper shows 
that ensemble generalization error can be calculated by using two order parameters, that is, the 
similarity between a teacher and a student, and the similarity among students. The differential 
equations that describe the dynamical behaviors of these order parameters are derived in the case of 
general learning rules. The concrete forms of these differential equations are derived analytically in 
the cases of three well-known rules: Hebbian learning, perceptron learning and AdaTron learning. 
Ensemble generalization errors of these three rules are calculated by using the results determined 
by solving their differential equations. As a result, these three rules show different characteristics 
in their affinity for ensemble learning, that is "maintaining variety among students." Results show 
that AdaTron learning is superior to the other two rules with respect to that affinity. 
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I. INTRODUCTION 

Ensemble learning has recently attracted the attention of many researchers 
Ensemble learning means to combine many rules or learning machines (students in the 
following) that perform poorly. Theoretical studies analyzing the generalization performance 
by using statistical mechanics0, 0] have been performed vigorously [J, 0, 0]. 

Hara and Okada[4j theoretically analyzed the case in which students are linear percep- 
trons. Their analysis was performed with statistical mechanics, focusing on the fact that 
the output of a new perceptron, whose connection weight is equivalent to the mean of those 
of students, is identical to the mean outputs of students. Krogh and SoUichj^ analyzed 
ensemble learning of linear perceptrons with noises within the framework of batch learning. 
They showed that the generalization performance can be optimized by choosing the best 
size of learning samples for a large K limit, where K is the number of students, and that 
the generalization performance can be improved by dividing learning samples in the noisy 
situation when K is finite. 

On the other hand, Hebbian learning, perceptron learning and AdaTron learning are 
well-known as learning rules for a nonlinear perceptron, which decides its output by sign 
function 0, ^ Urbancziky] analyzed ensemble learning of nonlinear perceptrons 
that decide their outputs by sign functions for a large K limit within the framework of 
online learning^^. He treated a generalized learning rule that he termed a "soft version 
of perceptron learning," which includes both Hebbian learning and perceptron learning as 
special cases, and discussed it from the viewpoint of generalization error. As a result, he 
showed that though an ensemble usually has superior performance to a single student, an 
ensemble has no special advantage in the optimized case within the framework of the soft 
version of perceptron learning. He considered a limit of ensemble learning. 

Though Urbanczik discussed ensemble learning of nonlinear perceptrons within the frame- 
work of online learning, he treated only the case in which the number K of students is large 
enough. Determining differences among ensemble learnings with Hebbian learning, per- 
ceptron learning and AdaTron learning (three typical learning rules), is a very attractive 
problem, but it is one that has never been analyzed to the best of our knowledge. 

Based on the past studies, we discuss ensemble learning of K nonlinear perceptrons, 
which decide their outputs by sign functions within the framework of online learning and 
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finite K 



lisj ]. First, we show that an ensemble generalization error of K students can 
be calculated by using two order parameters: one is a similarity between a teacher and a 
student, the other is a similarity among students. Next, we derive differential equations that 
describe dynamical behaviors of these order parameters in the case of general learning rules. 
After that, we derive concrete differential equations about three well-known learning rules: 
Hebbian learning, perceptron learning and AdaTron learning. We calculate the ensemble 
generalization errors by using results obtained through solving these equations numerically. 
Two methods are treated to decide an ensemble output. One is the majority vote of students, 
and the other is an output of a new perceptron whose connection weight equals the mean 
of those of students. As a result, we show that these three learning rules have different 
properties with respect to an affinity for ensemble learning, and AdaTron learning, which 
is known to have the best asymptotic property ^, [l^ 
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|12| . is the best among the three 



learning rules within the framework of ensemble learning. 



II. MODEL 



Each student treated in this paper is a perceptron that decides its output by a sign 
function. An ensemble of K students is considered. Connection weights of students are 
Ji,J2, ,Jk- Jk = {Jki,--- ,JkN),k = 1,2, ■■■ ,K and input x = (xi,--- ,xn) are N 
dimensional vectors. Each component Xi of x is assumed to be an independent random 
variable that obeys the Gaussian distribution Af{0,l/N). Each component of J°, that is 
the initial value of Jk, is assumed to be generated according to the Gaussian distribution 
J\f{0, 1) independently. Thus, 

(x,) = 0, ((x,)2> = 1, (1) 



(4) = 0, {{&) = 1, (2) 
where (•) denotes the average. Each student's output is sgn('Ui/i), sgn('U2/2), ■ ■ ■ , sS^{ukIk) 



where 



sgn{ul) = <; (3) 
-1, ul < 0, 

Ukh = Jk X. (4) 



Here, Ik denotes the length of student Jk- This is one of the order parameters treated in this 
paper and will be described in detail later. In this paper, Uk is called a normalized internal 
potential of a student. 

The teacher is also perceptron that decides its output by a sign function. The teacher's 
connection weight is B. In this paper, B is assumed to be fixed where B = {Bi, ■ ■ ■ , Bj^i) 
is also an dimensional vector. Each component Bi is assumed to be generated according 
to the Gaussian distribution Af{0, 1) independently. Thus, 

{B,) = 0, {{B.f) = l. (5) 

The teacher's output is sgn(w) where 

V = B X. (6) 

Here, v represents an internal potential of the teacher. For simplicity, the connection weight 
of a student and that of the teacher are simply called student and teacher, respectively. 
In this paper the thermodynamic limit — cxd is also treated. Therefore, 

|a;| = 1 



I, \b\ = Vn, \JI\ = Vn, (7) 

where | ■ | denotes a vector norm. Generally, a norm of student \Jk\ changes as the time step 
proceeds. Therefore, the ratio Ik of the norm to \/N is considered and is called a length of 
student Jk- That is, 

|Jfc|=/fcViV, (8) 

where Ik is one of the order parameters treated in this paper. 

The common input x is presented to the teacher and all students in the same order. 
Each student compares its output and an output of the teacher for input x. Each student's 
connection weight is corrected for the increasing probability that the student output agrees 
with that of the teacher. This procedure is called learning, and a method of learning is 
called learning rule, of which Hebbian learning, perceptron learning and AdaTron learning 



are well-known examples 
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12| | . Within the framework of online learning, information 
that can be used for correction other than that regarding a student itself is only input x and 
an output of the teacher for that input. Therefore, the update can be expressed as follows, 

j^' = Jk+ /r^"^, (9) 
/r = /(sgnK),un, (10) 



where m denotes time step, and / is a function determined by learning rule. 

In this paper, two methods are treated to determine an ensemble output. One is the 
majority vote of K students, which means an ensemble output is decided to be +1 if students 
whose outputs are +1 exceed the number of students whose outputs are —1, and —1 in the 
opposite case. 

Another method for deciding an ensemble output is adopting an output of a new per- 
ceptron whose connection weight is the mean of the weights of K students. This method is 
simply called the weight mean in this paper. 



III. THEORY 



In this paper, the majority vote and the weight mean are treated to determine an ensemble 
output. We use 

e = e ^-sgn {B ■ x) sgn sgn (J^ ■ a?) j j , (11) 



and 




(12) 



as error e for the majority vote and the weight mean, respectively. Here, e, x and Jk denote 
e"^, x"^ and J^, respectively. However, superscripts m, which represent time steps, are 
omitted for simplicity. Then, G(-) is the step function defined as 

i +1, z> 0, 

0(^) = (13) 
I 0, z<0. 

In both cases, e = if an ensemble output agrees with that of the teacher and e = 1 
otherwise. Generalization error eg is defined as the average of error e over the probability 
distribution p{x) of input x. The generalization error eg can be regarded as the probability 
that an ensemble output disagrees with that of the teacher for a new input x. One purpose 
of statistical learning theory is to theoretically obtain generalization error. In the case of a 
majority vote, using Eqs. (jH), (jHI) and (fTTj) . we obtain 

e = e f-sgn(t;) ^ sgn {uk) \ . (14) 
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In the case of a weight mean, using Eqs. (jH), (0) and (fT^ . we obtain 



K 



e -sgn (v) sgn ^ Mfc . (15) 



^k=l 



That is error e can be described as e = e{{uk},v) by using a normahzed internal potential 
Uk for the student and an internal potential v for the teacher in both cases. Therefore, the 
generalization error tg can be also described as 



<^g = J dxp{x)e 

K 

= / Y\_dukdvp{{uk},v)e{{uk},v), (16) 
^ k=i 

by using the probability distribution p{{uk},v) of Uk and v. From Eq. (jlj), we can write 



N 

i ' 

Uk 



1 ^ 

-"^JkiXi, (17) 



i=l 

where JkiXi,i = 1, ■ ■ ■ , N are independent and identically distributed random variables. In 
the same manner, from Eq. we can write 

N 

v = J2BiXi, (18) 

i=l 

where BiXi,i = I,-- - ,N are independent and identically distributed random variables. 
Since the thermodynamic limit ^ oo is also considered in this paper, Uk and v obey the 
multiple Gaussian distribution based on the central limit theorem. The discussion in this 
paper falls within the framework of online learning, which means input x, once used for 
an update, is abandoned and x for each time step is generated according to the Gaussian 
distribution of Eq. (0). Therefore, since an input x and a student Jk have no correlation 
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with each other, from Eq. (jU, the mean and the variance of Uk are 

' ] 
h 



{uk) = i^Jk-x) (19) 



1 ^ \ 

Y I] Jk^Xi ) (20) 



1=1 



1 ^ 

(21) 



1=1 



(22) 

{{uk?) = {[^/k-x] ) (23) 



1 



1 ^ TV \ 



^,Y.{{Jk^f){{x.f) (25) 



A; j=i 



1, (26) 



respectively. In the same manner, since an input x and a teacher B have no correlation 
with each other, from Eq. (jU)), the mean and the variance of v are 

{v) = {B-x) (27) 

I N \ 

Y^B^Xi) (28) 



N 



= E (B^) (^^) (29) 
1=1 

= 0, (30) 

(v^) = {{B-xf) (31) 

IN N \ 

Y,B.x,Y,B,xA (32) 

. i=l 3=1 I 



N 



Y^{{B,f){{x,f) (33) 



i=l 



= 1, (34) 

respectively. 

From these, all diagonal components of the covariance matrix I] of p({ufc}, v) equal unity. 
Let us discuss a direction cosine between connection weights as preparation for obtaining 
non-diagonal components. First, Rk is defined as a direction cosine between a teacher B 
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and a student Jk- That is, 

When a teacher B and a student have no correlation, R}^ — 0, and — 1 when the 
directions of B and Jk agree. Therefore, Rk is called the similarity (overlap in other word) 
between teacher and student in the following. Furthermore, Rk is the second order parameter 
treated in this paper. Next, qkk' is defined as a direction cosine between a student Jk and 
another student Jk'- That is, 

Qkk' = I /ii I = JkiJk'i: (36) 

where k ^ k' . When a student Jk and another student Jk' have no correlation, qkk' — 0, 
and gfefe/ = 1 when the directions of Jk and Jk' agree. Therefore, qkk' is called the similarity 
among students in the following, and qkk' is the third order parameter treated in this paper. 
Covariance between an internal potential v of a teacher B and a normalized internal 

potential Uk of a student Jk equals a similarity Rk between a teacher B and a student Jk 
as follows, 

/l ^ TV \ 

{vuk) = ( 1- X] BiXi ^ JkjXj ) (37) 
\ i=i j=i I 

1 ^ 

= Y,{B,Jki){{x,f) (38) 

^ i=i 

1 ^ 

= (40) 

Covariance between a normalized internal potential Uk of a student Jk and a normalized 
internal potential Uk' of another student Jk' equals a similarity qkk' among students as 



9 



follows, 



{UkUk') 



N N 



1=1 



1 ^ 

-ry^_.{JkiJk'i) {{xif) 

klk' ^ 
1=1 

1 ^ 



Qkk' 



i=l 



Therefore, Eq. can be rewritten as 



K 



Y\ dukdvp{{uk}, v)e{{uk}, 



k=l 



p{{uk},v) 



(27r)^|S|i 



X exp 



({Mfc},w)E \{uk},vf 



( 



1 gi2 
g2i 1 

qxi 

Ri 



qiK Ri 

<lK-l,K '■ 
<iK,K-l 1 Rk 

... Rk I I 



(41) 
(42) 

(43) 
(44) 

(45) 



(46) 



(47) 



As a result, a generalization error eg can be calculated if all similarities Rk and qky are 
obtained. Let us thus discuss differential equations that describe dynamical behaviors of 
these order parameters. In this paper, norms of inputs, teacher and students are set as 
Eq. (|7j); influence of input can be replaced with the average over the distribution of inputs 
(sample average) in a large N limit. This idea is called self-averaging in statistical mechanics. 
Differential equations regarding 1^. and Rk for general learning rules have been obtained based 
on self- averaging as follows ^j. 



dRk _ {fkv) - ifkUk) Rk _ Rk 
Ik 2ll 



dt 



if.) 



(48) 
(49) 
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where (■) stands for the sample average. That is, 

{fkUk) = / dukdvp2{uk,v)f{sgn{v),Uk)uk, 



ifkv) = J dukdvp2{uk,v)f{sgn{v),Uk)v, 
(fk) = / dudvp2iuk,v){f{sgn{v),Uk)f 



(50) 
(51) 
(52) 



P2{Uk,v) 



1 



27r|S2|2 



X exp ( J-^^-)^2\uk,vr . ^^3^ 



(54) 



1 Rk 
Rk 1 



Next, let us derive a differential equation regarding qkk' for the general learning rule. 

m , / im 

k ~^ 'fc, 'fc 



Considering a student Jk and another student Jk' and rewriting as — > Ik, IT'^'^ h + dlk, 



Qkk' ~^ Ikk', QkiJ^^ ~^ Qkk' + dqkk' and 1/N ^ dt, a differential equation regarding q is obtained 
as follows I^J, 

dqkk' _ ifk'Uk) — Qkk' ifk'Uk') 
dt Ik' 

{fkUk') - qkk' {fkUk) 



+ 



_^ ifkfk') _ Qkk^ f {fl) ^ {fl) \ ^gg^ 
Iklk' 2 V /? til 
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from Eqs. Q, (jlHjl and self-averaging, where 



(/fc^ifc') = J dukduk'dvp3{uk,Uk',v) 
xf{sgn{v),Uk)uk', 

xf{sgia{v),Uk')uk, 



P3{Uk,Uk',v) 



xf{sgn{v), Uk)f {sgn{v), Uk'), 
1 



X exp 



(27r)i|S3|5 



^ 1 Qkk' Rk ^ 



Q'fc'A: 1 Rk' 

Rk Rk' 1 J 



(56) 
(57) 
(58) 



(59) 
(60) 



IV. RESULT 



A. Conditions of analytical calculations 

As described above, in this paper each component of initial value of student Jk and 
teacher B is generated independently according to the Gaussian distribution Af{0, 1), and 
the thermodynamic limit — oo is considered. Therefore, all J° and B are orthogonal to 
each other. That is, 

Rl = 0, g°,, = 0. (61) 
From Eq. (|HT|) and symmetry of students, we can write 



ifkUk') = {fk'Uk) , ifkfk') = ifk'fk) 



(62) 



in Eq. (j55p . From Eq. and symmetry among students, we omit subscripts k,k' from 

order parameters lk,Rk and qkk' in Eqs. fliH |) -(|H m) and write them as l,R and q. In the 
following sections, we analytically obtain five sample averages (fkUk), (fkv), {fl)^ {fkUk') 
and {fkfk') concretely, which are necessary to solve Eqs. PHjl -fpI j) with respect to typical 



12 



learning rules under the conditions given in Eqs. ()6H) -()62 p . R and q are obtained by solving 
the above sample averages and Eqs. PHjl C ljl^ C and (pT|) numerically. We obtain 
numerical ensemble generalization errors eg by solving Eq. (|45p with the obtained R and q. 



B. Hebbian learning 

The update procedure for Hebbian learning is 

/(sgn(f),M) = sgn(f). (63) 

Using this expression, {fkUk), {fkv) and (/|) in the case of Hebbian learning can be 
obtained as follows by executing Eqs. (j3m) - (j3^ analytically 0, 

{fkUk) = ^, {hv) = \[^, {fi) = l. (64) 
V27r V vr 



In this section, {fkUk') and {fkfk') are derived. Since Eq. (j63|) is independent of u, we 
obtain 

2,R 

ifkUk') = ifkUk) = -j==, (65) 

V ZTT 



{hfk') = {{^g<v)f) = 1. (66) 

R and q have been obtained by solving Eqs. (gHl), (021), (ES), (ED), (|S2I), (EH)-® 
numerically. We have obtained numerical ensemble generalization errors eg in the case of 

= 3 by using Eqs. ()45|) - (P7j) and the above R and q. Figure ^ shows the results. In 
this figure, MV and WM indicate the majority vote and the weight mean, respectively. 
Numerical integrations of Eq. (|45|) in theoretical calculations have been executed by using 
the six-point closed Newton-Cotes formula. In the computer simulation, = 10^ and 
ensemble generalization errors have been obtained through tests using 10^ random inputs at 
each time step. In this figure, the result of theoretical calculations of i^' = 1 is also shown to 
clarify the effect of the ensemble. This figure shows that the ensemble generalization errors 
obtained by theoretical calculation explain the computer simulation quantitatively. 

Figures|2Hnishow the results of computer simulations where = 10^, i^' = 1, 3, 11, 31 until 
t = 10^ in order to investigate asymptotic behaviors of generalization errors. Asymptotic 



behavior of generaj 
at unity is 0{t~^) 



ization error in Hebbian learning in the case of the number K of students 
Asymptotic orders of the generalization error in the case of ensemble 
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a 

o 



K=l (Theory) 
K=3, MV (Theory) 

K=3, MV (simulation) 
K=3, WM (Theory) 

K=3, WM (simulation) 




4 6 
Time: t=m/N 



FIG. 1: Dynamical behaviors of ensemble generalization error €g in Hebbian learning. 

learning are considered equal to those of K = 1, since properties of K = 3,11, 31 are parallel 
to those of = 1 in these figures. 



I 

o 



0.1 



0.01 



0.001 



0.0001 



0.1 



K=l, Theory 
K=l 
K=3 (MV) 
K=ll (MV) 
K=31 (MV) 



10 100 
Time : t=m/N 



1000 



le4 



FIG. 2: Asymptotic behavior of generalization error of majority vote in Hebbian learning. Com- 
puter simulations, except for the solid line. Asymptotic order of ensemble learning is the same as 
that at K = 1. 



To clarify the relationship between K and the effect of ensemble, we have obtained 
theoretical ensemble generalization errors for various values of K. Here, it is difficult to 
execute numerical integration of Eq. (^3)) when K > 3 hj the Newton-Cotes formula used 
in the calculations for Figure ^ Therefore, the Metropolis method, which is a type of 
MonteCarlo method, has been used. We then orthogonalized the variables of integration to 
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K=l, Theory 
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K=ll (WM) 
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10 100 
Time : t=m/N 



1000 



le4 



FIG. 3: Asymptotic behavior of generalization error of weight mean in Hebbian learning. Computer 
simulations, except for the solid line. Asymptotic order of ensemble learning is the same as that 
at = 1. 



eliminate the calculation of inverse matrices of Eq. (j47|) . That is, 

Uk = aUk + bu + cv, k = 1,2, ■ ■ ■ , K, (67) 

where Uk,Uk,u and v obey the Gaussian distribution A/'(0, 1) and Uk,u and v have no cor- 
relation with each other. Considering that subscripts k, k' have been omitted from order 
parameters Rk, qkk' and Eq. (jTTjl . conditions that a, b and c must satisfy are 

a'^ + b'^ + c^ = 1, 

b' + c' = q, 

c = R. 

Therefore, 



a 
b 



R. 



By using these a, b and c, we can rewrite Eqs. ()45|) - ()47p as follows: 



Pl[U) 



{2n) 



K 

JJ^ dukPi{uk)dupi{u)dvpi{v)e{{auk + bu + cv], v), 

k=l 

1 / 

exp 



(68) 
(69) 
(70) 



(71) 
(72) 
(73) 

(74) 
(75) 
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These operations orthogonalized the variables of integration in exchange for their number 
having been increased from K + 1 to K + 2. The muhiple Gaussian distribution function 
p{{uk}, v) can be rewritten as products of simple Gaussian distribution functions Pi(-) by this 
orthogonalization. Thus, calculations of inverse matrices of Eq. (j47j) become unnecessary. 
These facts have made it easy to perform the numerical calculations of the generalization 
error for a large K. 

Figure m shows the results obtained by the Metropolis method using the values of R and 
q calculated numerically for Hebbian learning and Eqs. (frT|) - ()75p . Calculations have been 
executed for = 1, 3, 5, 7, 9, 11, 13, 21, 31 and 51 in both the majority vote (MV) and 
the weight mean (WM). The number of MonteCarlo steps is 10^. These theoretical results 
are fitted to two quadratic curves. In this figure, the results of computer simulations where 

= lO^Ci^ = 1, 3, 5, 7, 9, 11, 13, 21, 31 and 51 have also been drawn for comparison with 
the theoretical calculations. In the computer simulations, ensemble generalization errors 
have been obtained through tests using 10^ random inputs. The figures show the values of 
t = 50 for both theoretical calculations and computer simulations, and this is the time for 
which is considered that the learnings are sufficiently within the asymptotic regions with 
respect to Figures|2HSl Here, since the relationship between 1/K and ensemble generalization 
errors shows a straight line in the case of linear perceptrons, the abscissa is l/i^ in Figure 
lU The ordinates have been normalized by the theoretical ensemble generalization error of 
K = 1 Sindt = 50. 

^ 1.03 
II 

I 

tS 1.01 
c 

■2 1 

a 

N 

% 0.99 

g 0.98 
O 

•a 0.97 
I 0.96 
I 0.95 

FIG. 4: Relationship between K and effect of ensemble in Hebbian learning. Ensemble generaliza- 
tion error Cg for a large K limit is about 0.99 times that K = 1. 



* 




Hebb, MV, Theory 
Hebb, MV, Simulation 

Hebb, WM, Theory 
Hebb, WM, Simulation 



0.2 0.4 0.6 0.8 1 

1/K 
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C. Perceptron learning 



The update procedure for perceptron learning is 

/(sgn(t>), m) = 9 (— Mf ) sgn(t>). (76) 
Using this expression, {fkUk), {fkv) and (/|) in the case of perceptron learning can be 



obtained as follows by executing Egs- lj^Ujl - lj^ analytically 



e oi p6 



ifkUk) = ifkv) = (77) 



1 1 

= -tan"^^-— — . (78) 

TT R 

In this section, {fkUk') and {fkfy) are derived. Using Eq. ((701), {fkUk') and {fkfk') in 
the case of perceptron learning are obtained as follows by executing Eqs. (fSU]) and (fSH|) 
analytically. 



(/fcMfc') = j dukduk'dvps{uk,Uk',v) 

xQ{-Ukv)sgn{v)uk' 
R-q 



(79) 



xQ{-Ukv)Q{-Uk'v) 

POO POO 

= 2 Dv DxH{z) (80) 



Rv 



where 

-{(l-R^)x^R^J\- RH 
v/(l-g)(l + g"^^2^ 

and the definitions of -??(«) and are 

oo 



(81) 



H{u) = Dx (82) 

J u 

^ ^expf-^V (83) 



2vr V 2 

In the same manner as Hebbian learning, R and q have been obtained by solving Eqs. 
PUI) . (|^. (|UT|). (f77 j) -(|H m) numerically. We have obtained numerical ensemble 
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generalization errors eg in the case of = 3 by using Eqs. (P^ - fjlTjl and the above R and 
q. Figure El shows the results. This figure shows that the ensemble generalization errors 
obtained by theoretical calculation explain the computer simulation quantitatively. 



0.5 

I 0.4 
_o 

^ 3 

g 0.2 
O 

0.1 

2 4 6 8 10 

Time: t=m/N 

FIG. 5: Dynamical behaviors of ensemble generalization error eg in perceptron learning. 

Figures EHZI show the results of computer simulations where = 10^, K = 1,3,11,31 
until t = 10^ in order to investigate asymptotic behaviors of generalization errors. Effect of 
ensemble is maintained asymptotically. Asymptotic behavior of generalization error in per- 
ceptron learning in the case of the number K of students at unity is 0(t~3)j^. Asymptotic 
orders of the generalization error in the case of ensemble learning are considered equal to 
those of i^' = 1, since properties of K = 3, 11,31 are parallel to those of -ft^ = 1 in these 
figures. 

To clarify the relationship between K and the effect of ensemble, we have obtained 
theoretical ensemble generalization errors for various values of K. In the same manner as 
Hebbian learning. Figure |H1 shows the results obtained by the Metropolis method using the 
values of R and q calculated numerically for perceptron learning and Eqs. (f7T|) - (f73|) . 

D. AdaTron learning 

The update procedure for AdaTron learning is 

/(sgn(t>),M) = —uQ {—uv) . (84) 
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FIG. 6: Asymptotic behavior of generalization error of majority vote in perceptron learning. Com- 
puter simulations, except for the solid line. Asymptotic order of ensemble learning is the same as 
that at K = 1. 
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FIG. 7: Asymptotic behavior of generalization error of weight mean in perceptron learning. Com- 
puter simulations, except for the solid line. Asymptotic order of ensemble learning is the same as 
that at K = 1. 



Using this expression, (fkUk), {fkv) and (f^) in the case of AdaTron learning can be 
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FIG. 8: Relationship between K and effect of ensemble in perception learning. Ensemble general- 
ization error eg for a large K limit is about 0.72 times that oi K = \. 



obtained as follows by executing Eqs. (f^ - (j3^ analytically 



{huu) = -2 Duu'H (^7^^ ) (85) 
1 , f R 

= cot"^ ' 



71 \y/l-R^ 
1 



+-RVI - (86) 

TT 

Ukv) = -^i— ^ + (87) 

TT 

{fl) = - ifku,) . (88) 

In this section, {fkUk') and {fkfk') are derived. Using Eq. (jHH), {fkUk') and {fkfk') 
in the case of AdaTron learning are obtained as follows by executing Eqs. (j56p and (j58p 
analytically. 
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ifkUk') 



ifkfk') 



dukduk'dvp3{uk, Mfc', v)Q{-Ukv)ukUk' 
—RVl-R^-2q / Dv / Dxx^ 

TT Jo J^S^ 

dvdukUkduk'Uk'P^iuk, Uk', v), Q{-^kv)&{-Uk'V 
l-qf{l + q-2R^ 



(89) 



27r(l-i?2)2 
2i?(l + g-i?2) 




1-g 



i? + 2(g - i?^ 



Dv 



Dxx^H (z) 



/'OO /"CXD /'OO /"OO 

/ Dvv / Dxx// (2) + 2R^ / / Dxif 

Jo I Jo J «^ 



(90) 



where the definitions of z, H{u) and Dx are Eqs. (|HT|). (jS^ and (jSH|) . respectively. 

In the same manner as Hebbian learning, R and g have been obtained by solving Eqs. 
(HHll . (gni), dSSl), (jnH), (ES), (IHSI)-(inni) numerically. We have obtained numerical ensemble 
generalization errors eg in the case of = 3 by using Eqs. ()45|) - (P7j) and the above R and 
q. Figure IHl shows the results. This figure shows that the ensemble generalization errors 
obtained by theoretical calculation explain the computer simulation quantitatively. 
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FIG. 9: Dynamical behaviors of ensemble generalization error eg in AdaTron learning. Improvement 
of eg by increasing K from 1 to 3 is largest of the three learning rules. 



Figures ITUHTT] show the results of computer simulations where N = 10^, K = 1,3, 11,31 
until t = 10*^ in order to investigate asymptotic behaviors of generalization errors. Effect of 
ensemble is maintained asymptotically. Asymptotic behavior of generalization error in Ada- 
Tron learning in the case of the number K of students at unity is 0(t~^)j9, 12 1. Asymptotic 
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orders of the generalization error in the case of ensemble learning are considered equal to 
those oi K — 1, since properties of K — 3, 11, 31 are parallel to those of X = 1 in these 
figures. 




FIG. 10: Asymptotic behavior of generalization error of majority vote in AdaTron learning. Com- 
puter simulations, except for the solid line. Asymptotic order of ensemble learning is the same as 
that at K = 1. 
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FIG. 11: Asymptotic behavior of generalization error of weight mean in AdaTron learning. Com- 
puter simulations, except for the solid line. Asymptotic order of ensemble learning is the same as 
that at K =1. 



To clarify the relationship between K and the effect of ensemble, we have obtained 
theoretical ensemble generalization errors for various values of K. In the same manner as 
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Hebbian learning, Figure IT^ shows the results obtained by the Metropolis method using the 
values of R and q calculated numerically for perceptron learning and Eqs. (f7T |) -(f73 j) . 




FIG. 12: Relationship between K and effect of ensemble in AdaTron learning. Ensemble general- 
ization error eg for a large K limit is about 0.68 times that oi K = \. 



V. DISCUSSION 

Figures CtEnSP IHTini and El show that the generalization errors of the three learning 
rules are all improved by ensemble learning. However, the degree of improvement is small 
in Hebbian learning and large in AdaTron learning. First, we discuss the reason for this 
difference in the following. 

Each student moves towards teacher as learning proceeds. Therefore, similarities Rk 
and qkk' increase and approach unity, leading to Rk and qkk' becoming less irrelevant to 
each other. For example when Rk = Rk' = 1, qkk' cannot be 7^ 1 since a teacher B, a 
student Jk and another student Jk' have the same direction. Thus, Rk and qkk' are under 
a certain restraint relationship each other. When qkk' is relatively smaller when compared 
with Rk, variety among students is further maintained and the effect of the ensemble can 
be considered as large. On the contrary, after qkk' becomes unity, a student Jk and another 
student Jk' are the same and there is no merit in combining them. 

Let us explain these considerations intuitively by using Figure El Both (a) and (b) show 
the relationship among two students Ji, J 2 and a teacher B when learning has proceeded 
to some degree from the condition that the students and the teacher have no correlation. 
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Then, as shown in Figure IT^ students must distribute to points the same distance from the 
teacher. That is, the similarity Ri of the teacher and a student Ji equals the similarity 
i?2 of the teacher and a student J2 in both (a) and (b). Here, (a) shows the case in which 
students are unlike each other — in other words the variety among students is large, that is, 
q is small. In this case, it is obvious that a mean vector of Ji and J2 is closer to the teacher 
B than either Ji or J 2- Therefore, a mean vector J2k=i "^k of the students' connection 
weights can closely approximate the connection weight vector B of the teacher in cases like 
(a). In addition, a combination method other than a mean of students, e.g. the majority 
vote of students, must approximate the teacher better than each student can do alone in 
cases like (a). In this case, the effect of ensemble learning is strong. On the contrary. 
Figure IT^ b) shows the case in which students are similar to each other — in other words, 
the variety among students is small, meaning q is large. In this case, the significance of 
combining two students is small since their outputs are almost always the same. Therefore, 
effect of ensemble learning is small when q is large, as in Figure IT3f b). Thus, the relationship 
between Rk and qkk' is essential to know in ensemble learning. 




(a) (b) 

FIG. 13: Variety among students. 



Figure shows a comparison between the theoretical results regarding the dynamical 
behaviors of R and q of Hebbian learning, which are obtained by solving Eqs. ()48|) . (UHl), 
fl^ . (jnH), (|n2I), numerically and by computer simulation (A^ = 10^). In the 

same manner. Figure ^1 shows a comparison between the theoretical results regarding the 
dynamical behaviors of R and q of perceptron learning, which are obtained by solving 
Eqs. (gHl), (Eni), (ES), (EH), (ES), numerically and by computer simulation (A^ = 

10^). Figure ITHl shows a comparison between the theoretical results regarding the dynamical 
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behaviors of R and q of AdaTron learning, which are obtained by solving Eqs. ()48p . ()49p . 

(pT|) . (jnH), (jHBj) - (pH) numerically and by computer simulation (A^ = 10^). In these 
figures, the theoretical results and the computer simulations closely agree with each other. 
That is, the derived theory explains the computer simulation quantitatively. Figure IT^ shows 
that q rises more rapidly than R in Hebbian learning; in other words, q is relatively large 
when compared with i?, meaning the variety among students disappears rapidly in Hebbian 
learning. Figure IT31 shows that q is smaller than R in the early period of learning (t < 4.0), 
which means perceptron learning maintains the variety among students for a longer time 
than Hebbian learning. Figure ^1 shows that q is relatively smaller when compared with 
R than in the cases of Hebbian learning and perceptron learning. This means AdaTron 
learning maintains variety among students most out of these three learning rules. 
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FIG. 14: Dynamical behaviors of R and q in Hebbian learning. Here, q rises more rapidly than R, 
which means the variety among students disappears rapidly in Hebbian learning. 



Figures IT1HT61 show that q is relatively small when compared with R in the case of 
AdaTron learning than in Hebbian learning and perceptron learning. As described before, 
the relationship between R and q is essential in ensemble learning. To illustrate this. Figure 
[T7I shows the relationship more clearly by taking R and q as axes. In this figure, the curve 
for AdaTron learning is located in the bottom. That is, of the three learning rules, the 
one offering the smallest q when compared with R is AdaTron learning. In other words, 
the learning rule in which the rising of q is the slowest and the variety among students is 
maintained best is AdaTron learning. 
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FIG. 15: Dynamical behaviors of R and q in perceptron learning. Here, q is smaller than R in the 
early period of learning (t < 4.0). Perceptron learning maintains the variety among students for a 
longer time than Hebbian learning. 
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FIG. 16: Dynamical behaviors of R and q in AdaTron learning. Here, q is relatively smaller when 
compared with R than in the cases of Hebbian learning or perceptron learning. AdaTron learning 
maintains variety among students most out of these three learning rules. 

These characteristics can be understood from the update expression of each rule. Equa- 
tion (jUHj) means that an update by Hebbian learning depends on only the output sgn(f ) 
of a teacher. That is, all students are updated identically at all time steps. Therefore, 
the similarity of students increases rapidly in Hebbian learning. On the other hand, the 
update by perceptron learning equals that of Hebbian learning times Q{—uv), as shown in 
Eq. (fTBj). Students whose outputs are opposite to that of a teacher change their connection 
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weights. At least in the initial period of learning, students whose output is opposite to that 
of a teacher and students whose output is the same as that of a teacher both exist. As a 
result, students that change their connection weights and students who don't change their 
connection weights both exist, leading to the fact that variety among students by perceptron 
learning is better maintained than by Hebbian learning. The update by AdaTron learning 
is given in Eq. (jH3). This can be rewritten as f {sgn{v) , u) = \u\Q{—uv)sgn{v) . That is, the 
update by AdaTron learning equals that of perceptron learning times \u\, which depends 
on the students. Therefore, the variety among students by AdaTron learning is still better 
maintained. 




0.2 0.4 0.6 0.8 1 
Overlap R 

FIG. 17: Relationship between R and q (Theory). Here, q of AdaTron learning is the smallest when 
compared with R. The rising of q is the slowest and variety among students is best maintained in 
AdaTron learning. 

In the discussion above, the reason why the degree of improvement by ensemble learning 
is small in Hebbian learning and large in AdaTron learning as shown in Figures ^ IH El El 
El and ^1 have been explained. AdaTron learning originally featured the fastest asymptotic 
characteristic of the three learning rules[3]. However, it has disadvantage that the learning is 
slow at the beginning; that is, the generalization error is larger than for the other two learning 
rules in the period of t < 6. This paper shows that the fastest asymptotic characteristic 
of AdaTron learning is maintained in ensemble learning and that AdaTron learning has a 
good affinity with ensemble learning in regard to "the variety among students" and the 
disadvantage of the early period can be improved by combining it with ensemble learning. 
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From the perspective of the difference between the majority vote and the weight mean, 
Figures ^ |3J El El and ^1 show that the improvement by weight mean is larger than 
that by majority vote in all three learning rules. Improvement in the generalization error 
by averaging connection weights of various students can be understood intuitively because 
the mean of students is close to that of the teacher in Figure IT^ a). The reason why the 
improvement in the majority vote is smaller than that in the weight mean is considered to 
be that the variety among students cannot be utilized as effectively by the majority vote as 
by the weight mean. However, the majority vote can determine an ensemble output only 
using outputs of students, and is easy to implement. It is, therefore, significant that the 
effect of an ensemble in the case of the majority vote has been analyzed quantitatively. 

Figures El IHl and also show that the ensemble generalization errors eg by the majority 
vote are larger than those by the weight mean in the case of K < (yo. In both perceptron 
learning and AdaTron learning, the relationship between 1/K and eg shows a straight line 
and an upwards-convex curve in the case of the weight mean and the majority vote, respec- 
tively. The ensemble generalization errors eg in the cases of the majority vote and the weight 
mean agree with each other at a large K limit. This fact agrees with the description in p. 
Therefore, the weight mean is superior than the majority vote especially in the case of a 
small K. Moreover, it is shown that eg for a large K limit compared with that of = 1 is 
about 0.99, 0.72 and 0.68 times in Hebbian, perceptron and AdaTron learning, respectively. 
It has been confirmed that ensemble has the strongest effect in AdaTron learning among 
three learning rules. 



VI. CONCLUSION 



This paper discussed ensemble learning of K nonlinear perceptrons, which determine their 
outputs by sign functions within the framework of online learning and statistical mechanics. 
One purpose of statistical learning theory is to theoretically obtain the generalization error. 
In this paper, we have shown that the ensemble generalization error can be calculated by 
using two order parameters, that is the similarity between the teacher and a student, and the 
similarity among students. The differential equations that describe the dynamical behaviors 
of these order parameters have been derived in the case of general learning rules. The 
concrete forms of these differential equations have been derived analytically in the cases 
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of three well-known rules: Hebbian learning, perceptron learning and AdaTron learning. 
We calculated the ensemble generalization errors of these three rules by using the results 
determined by solving their differential equations. As a result, these three rules have different 
characteristics in their affinity for ensemble learning, that is, "maintaining variety among 
students." The results show that AdaTron learning is superior to the other two rules with 
respect to that affinity. 
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